Preserve pi-acp model metadata through LiteLLM proxy by bingran-you · Pull Request #803 · benchflow-ai/benchflow

bingran-you · 2026-06-18T07:14:09Z

Summary

copy BENCHFLOW_PROVIDER_MODELS metadata onto the LiteLLM proxy alias for pi-acp
preserve maxTokens/contextWindow when pi-acp is routed through usage-tracked LiteLLM
add a regression test for vLLM/Qwen metadata aliasing

Verification

uv run python -m py_compile src/benchflow/providers/litellm_runtime.py tests/test_litellm_runtime.py
uv run pytest tests/test_litellm_runtime.py -q (blocked locally: missing acp package in fresh uv env)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eeec75bb7d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-18T07:17:42Z


+@pytest.mark.asyncio
+async def test_pi_acp_proxy_preserves_provider_model_metadata(monkeypatch):
+    """Pi sees the LiteLLM alias, so metadata must be copied to that id."""


Name the regression guard in the test docstring

This new test is the regression coverage for the Pi/LiteLLM aliasing fix, but its docstring does not identify the PR or commit it guards. The root AGENTS.md explicitly requires regression tests to name the PR/commit they guard, so please include this commit/PR identifier in the docstring to preserve the maintenance context.

Useful? React with 👍 / 👎.

greptile-apps · 2026-06-18T07:18:26Z

Greptile Summary

This PR adds metadata propagation for pi-acp agents routed through the LiteLLM proxy: when BENCHFLOW_PROVIDER_MODELS is present, the entry matching the original model is cloned under the proxy alias ID/name so pi-acp can still resolve maxTokens/contextWindow after the model name changes.

_provider_models_for_proxy_alias searches for a matching entry via a set of model-ID variants (requested_model, upstream_model, and their prefix-stripped forms), clones it with the alias id/name, and appends it only if the alias does not already exist.
The pi-acp branch in _wire_litellm_agent_env calls this helper and writes the result to BENCHFLOW_PROVIDER_MODELS when a match is found.
A new @pytest.mark.asyncio test exercises the vLLM/Qwen path end-to-end, asserting the alias id, name, maxTokens, and contextWindow.

Confidence Score: 5/5

The change is safe to merge: it adds a well-scoped helper with a narrow code path activated only for pi-acp, and leaves all other agents untouched.

The new helper is purely additive, guarded by a pi-acp branch already present in _wire_litellm_agent_env, and its only side-effect is appending an alias entry to the serialized model list — a no-op if the alias is already present. Direct assignment (not setdefault) is used for both id and name, and the test now asserts all three key fields on the alias entry. No pre-existing logic is modified.

No files require special attention.

Important Files Changed

Filename	Overview
src/benchflow/providers/litellm_runtime.py	New _provider_models_for_proxy_alias helper and pi-acp wiring look correct; direct assignment of id/name (not setdefault) ensures the alias entry is always consistent.
tests/test_litellm_runtime.py	New regression test covers the vLLM/Qwen alias path and now asserts alias["name"], maxTokens, and contextWindow — adequate coverage for the happy path.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Caller
    participant ensure_litellm_runtime
    participant _wire_litellm_agent_env
    participant _provider_models_for_proxy_alias
    participant LiteLLMProxy

    Caller->>ensure_litellm_runtime: "agent=pi-acp, model=vllm/Qwen/Qwen3-4B, BENCHFLOW_PROVIDER_MODELS=[{id:Qwen/Qwen3-4B,...}]"
    ensure_litellm_runtime->>_wire_litellm_agent_env: "agent_env, route(alias=benchflow-vllm-Qwen-Qwen3-4B)"
    _wire_litellm_agent_env->>_provider_models_for_proxy_alias: "raw=BENCHFLOW_PROVIDER_MODELS, route"
    note over _provider_models_for_proxy_alias: Builds wanted set, finds entry id=Qwen/Qwen3-4B, clones with id=name=alias
    _provider_models_for_proxy_alias-->>_wire_litellm_agent_env: JSON with original + alias entry
    _wire_litellm_agent_env-->>ensure_litellm_runtime: updated env (BENCHFLOW_PROVIDER_MODELS updated)
    ensure_litellm_runtime-->>Caller: (updated_env, provider_runtime)
    Caller->>LiteLLMProxy: "Uses BENCHFLOW_PROVIDER_MODEL=benchflow-vllm-Qwen-Qwen3-4B"
    LiteLLMProxy-->>Caller: Resolves maxTokens/contextWindow via alias entry

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Caller
    participant ensure_litellm_runtime
    participant _wire_litellm_agent_env
    participant _provider_models_for_proxy_alias
    participant LiteLLMProxy

    Caller->>ensure_litellm_runtime: "agent=pi-acp, model=vllm/Qwen/Qwen3-4B, BENCHFLOW_PROVIDER_MODELS=[{id:Qwen/Qwen3-4B,...}]"
    ensure_litellm_runtime->>_wire_litellm_agent_env: "agent_env, route(alias=benchflow-vllm-Qwen-Qwen3-4B)"
    _wire_litellm_agent_env->>_provider_models_for_proxy_alias: "raw=BENCHFLOW_PROVIDER_MODELS, route"
    note over _provider_models_for_proxy_alias: Builds wanted set, finds entry id=Qwen/Qwen3-4B, clones with id=name=alias
    _provider_models_for_proxy_alias-->>_wire_litellm_agent_env: JSON with original + alias entry
    _wire_litellm_agent_env-->>ensure_litellm_runtime: updated env (BENCHFLOW_PROVIDER_MODELS updated)
    ensure_litellm_runtime-->>Caller: (updated_env, provider_runtime)
    Caller->>LiteLLMProxy: "Uses BENCHFLOW_PROVIDER_MODEL=benchflow-vllm-Qwen-Qwen3-4B"
    LiteLLMProxy-->>Caller: Resolves maxTokens/contextWindow via alias entry

_{Reviews (3): Last reviewed commit: "Preserve pi-acp model metadata through L..." | Re-trigger Greptile}

greptile-apps · 2026-06-18T07:18:34Z

+        alias_entry = dict(entry)
+        alias_entry["id"] = route.model_alias
+        alias_entry.setdefault("name", route.model_alias)


When the original entry already has a "name" key — as in the test fixture where "name": "Qwen/Qwen3-4B" is present — setdefault is a no-op, so the alias entry ends up with id = "benchflow-vllm-Qwen-Qwen3-4B" but name = "Qwen/Qwen3-4B". If pi-acp uses name for display or any secondary lookup this creates a stale/inconsistent value. Using a direct assignment ensures the alias entry's name always reflects the alias ID.

Suggested change

alias_entry = dict(entry)

alias_entry["id"] = route.model_alias

alias_entry.setdefault("name", route.model_alias)

alias_entry = dict(entry)

alias_entry["id"] = route.model_alias

alias_entry["name"] = route.model_alias

greptile-apps · 2026-06-18T07:18:35Z

+            route=route,
+        )
+        if alias_models:
+            updated["BENCHFLOW_PROVIDER_MODELS"] = alias_models


Silent failure when no entry matches

If _provider_models_for_proxy_alias returns None (e.g., BENCHFLOW_PROVIDER_MODELS is set but none of its entries have an id matching any variant in wanted), updated["BENCHFLOW_PROVIDER_MODELS"] is left unchanged — keeping the original entry list that does not contain the alias ID. Pi-acp would then receive a model list without the proxied alias, causing the same lookup failure this PR aims to fix, with no log to indicate the miss. Adding a debug/warning log on the None return paths in _provider_models_for_proxy_alias would make this much easier to diagnose in production.

greptile-apps · 2026-06-18T07:18:36Z

+    assert provider_runtime is not None
+    assert updated["BENCHFLOW_PROVIDER_MODEL"] == "benchflow-vllm-Qwen-Qwen3-4B"
+    models = json.loads(updated["BENCHFLOW_PROVIDER_MODELS"])
+    alias = next(m for m in models if m["id"] == "benchflow-vllm-Qwen-Qwen3-4B")
+    assert alias["maxTokens"] == 1024
+    assert alias["contextWindow"] == 16384


name field of alias entry is not asserted

The test confirms maxTokens and contextWindow are preserved on the alias entry, but does not assert alias["name"]. Because setdefault is a no-op when "name" already exists in the source entry, the alias entry currently carries name = "Qwen/Qwen3-4B" rather than the alias "benchflow-vllm-Qwen-Qwen3-4B". Adding assert alias["name"] == "benchflow-vllm-Qwen-Qwen3-4B" would lock in the expected behaviour and catch a regression if the name handling changes.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

github-actions · 2026-06-18T23:45:32Z

Integration final review

Final verdict: not mergeable

Verdict

not mergeable

Blockers

unhealthy slot: data-to-d3-docker-no-skill-openhands (deterministic reject: ['V-TAMPER'])
unhealthy slot: weighted-gdp-calc-docker-no-skill-openhands (deterministic reject: ['V-TAMPER'])

Coverage

cell	task	agent	sandbox	skill_mode	status	detail
jax-computing-basics-docker-no-skill-openhands	jax-computing-basics	openhands	docker	no-skill	healthy	all gates green
jax-computing-basics-docker-no-skill-pi-acp	jax-computing-basics	pi-acp	docker	no-skill	healthy	all gates green
jax-computing-basics-docker-no-skill-opencode	jax-computing-basics	opencode	docker	no-skill	healthy	all gates green
data-to-d3-docker-no-skill-openhands	data-to-d3	openhands	docker	no-skill	unhealthy	deterministic reject: ['V-TAMPER']
data-to-d3-docker-no-skill-pi-acp	data-to-d3	pi-acp	docker	no-skill	healthy	all gates green
data-to-d3-docker-no-skill-opencode	data-to-d3	opencode	docker	no-skill	healthy	all gates green
weighted-gdp-calc-docker-no-skill-openhands	weighted-gdp-calc	openhands	docker	no-skill	unhealthy	deterministic reject: ['V-TAMPER']
weighted-gdp-calc-docker-no-skill-pi-acp	weighted-gdp-calc	pi-acp	docker	no-skill	healthy	all gates green
weighted-gdp-calc-docker-no-skill-opencode	weighted-gdp-calc	opencode	docker	no-skill	healthy	all gates green
citation-check-docker-no-skill-openhands-allowlist	citation-check	openhands	docker	no-skill	healthy	all gates green

Slots: healthy=8, unhealthy=2, missing=0, duplicate=0, stale=0 (planned=10)

Evidence

jax-computing-basics-docker-no-skill-openhands (healthy)
- root: jobs/integration-final/jax-computing-basics-docker-no-skill-openhands/2026-06-19__02-16-57/jax-computing-basics__69552840
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/jax-computing-basics-docker-no-skill-openhands/2026-06-19__02-16-57/jax-computing-basics__69552840 --json
jax-computing-basics-docker-no-skill-pi-acp (healthy)
- root: jobs/integration-final/jax-computing-basics-docker-no-skill-pi-acp/2026-06-19__02-16-57/jax-computing-basics__74dd6805
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/jax-computing-basics-docker-no-skill-pi-acp/2026-06-19__02-16-57/jax-computing-basics__74dd6805 --json
jax-computing-basics-docker-no-skill-opencode (healthy)
- root: jobs/integration-final/jax-computing-basics-docker-no-skill-opencode/2026-06-19__02-16-57/jax-computing-basics__ed6098c5
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/jax-computing-basics-docker-no-skill-opencode/2026-06-19__02-16-57/jax-computing-basics__ed6098c5 --json
data-to-d3-docker-no-skill-openhands (unhealthy)
- root: jobs/integration-final/data-to-d3-docker-no-skill-openhands/2026-06-19__02-16-59/data-to-d3__73173ee7
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=fail, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/data-to-d3-docker-no-skill-openhands/2026-06-19__02-16-59/data-to-d3__73173ee7 --json
data-to-d3-docker-no-skill-pi-acp (healthy)
- root: jobs/integration-final/data-to-d3-docker-no-skill-pi-acp/2026-06-19__02-17-44/data-to-d3__8cb1b2db
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/data-to-d3-docker-no-skill-pi-acp/2026-06-19__02-17-44/data-to-d3__8cb1b2db --json
data-to-d3-docker-no-skill-opencode (healthy)
- root: jobs/integration-final/data-to-d3-docker-no-skill-opencode/2026-06-19__02-17-01/data-to-d3__6ff11a4d
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/data-to-d3-docker-no-skill-opencode/2026-06-19__02-17-01/data-to-d3__6ff11a4d --json
weighted-gdp-calc-docker-no-skill-openhands (unhealthy)
- root: jobs/integration-final/weighted-gdp-calc-docker-no-skill-openhands/2026-06-19__02-16-59/weighted-gdp-calc__e8121277
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=fail, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/weighted-gdp-calc-docker-no-skill-openhands/2026-06-19__02-16-59/weighted-gdp-calc__e8121277 --json
weighted-gdp-calc-docker-no-skill-pi-acp (healthy)
- root: jobs/integration-final/weighted-gdp-calc-docker-no-skill-pi-acp/2026-06-19__02-17-00/weighted-gdp-calc__18f9642a
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/weighted-gdp-calc-docker-no-skill-pi-acp/2026-06-19__02-17-00/weighted-gdp-calc__18f9642a --json
weighted-gdp-calc-docker-no-skill-opencode (healthy)
- root: jobs/integration-final/weighted-gdp-calc-docker-no-skill-opencode/2026-06-19__02-17-00/weighted-gdp-calc__72b8e555
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/weighted-gdp-calc-docker-no-skill-opencode/2026-06-19__02-17-00/weighted-gdp-calc__72b8e555 --json
citation-check-docker-no-skill-openhands-allowlist (healthy)
- root: jobs/integration-final/citation-check-docker-no-skill-openhands-allowlist/2026-06-19__02-16-57/citation-check__f2b934ae
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/citation-check-docker-no-skill-openhands-allowlist/2026-06-19__02-16-57/citation-check__f2b934ae --json

Parity:

pinned-baseline pinned-baseline: fail — pinned-baseline parity FAIL: harbor: baseline git HEAD 3daf698 does not match pinned ref 2d86fe82f6a06f7c7b3a22a3ae90d554d0e9655c; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-7a4c6d6d-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-9c44b8b1-0003/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-aeadd837-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-5fd3682f-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-b86fffc1-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-ba03ab11-0005/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-missing-capped5-dbf6c36392/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-6cee37920c/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-9737e2a3b8/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: no matching result.json files under /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash; harbor: missing baseline result for citation-check; harbor: missing baseline result for data-to-d3; harbor: missing baseline result for jax-computing-basics; harbor: missing baseline result for weighted-gdp-calc; no overlapping SkillsBench tasks to compare

Residual risk

QUARANTINE: parity pinned-baseline (advisory — gate needs native-baseline mode): pinned-baseline — pinned-baseline parity FAIL: harbor: baseline git HEAD 3daf698 does not match pinned ref 2d86fe82f6a06f7c7b3a22a3ae90d554d0e9655c; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-7a4c6d6d-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-9c44b8b1-0003/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-aeadd837-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-5fd3682f-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-b86fffc1-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-ba03ab11-0005/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-missing-capped5-dbf6c36392/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-6cee37920c/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-9737e2a3b8/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: no matching result.json files under /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash; harbor: missing baseline result for citation-check; harbor: missing baseline result for data-to-d3; harbor: missing baseline result for jax-computing-basics; harbor: missing baseline result for weighted-gdp-calc; no overlapping SkillsBench tasks to compare
residual (from plan): network lane carries network_mode as EXPECTED only; the allowlist variant is not passed to bench (no --network flag exists)
V-LIFECYCLE / V-ENVHARDEN / V-REWARDHACK: codex/residual review (never faked deterministically)

Required reruns

rerun cell: data-to-d3-docker-no-skill-openhands
rerun cell: weighted-gdp-calc-docker-no-skill-openhands

bingran-you · 2026-06-19T05:43:53Z

Automation update (2026-06-19): pushed 212d29b6 to fix the stale ty check failure on eeec75bb.

What changed:

Typed the provider-model metadata lookup so uv run ty check src passes again.
Set the copied alias entry's name to the LiteLLM proxy alias as well as id, and added a regression assertion.
Updated the regression-test docstring to name PR Preserve pi-acp model metadata through LiteLLM proxy #803 per repo convention.

Local verification after the patch:

uv sync --extra dev --extra sandbox-daytona --locked
uv run pytest tests/test_litellm_runtime.py -q -> 20 passed
uv run ruff check src/benchflow/providers/litellm_runtime.py tests/test_litellm_runtime.py -> pass
uv run ruff format --check src/benchflow/providers/litellm_runtime.py tests/test_litellm_runtime.py -> pass
uv run ty check src -> pass

Current GitHub checks: normal test and pip-audit are green on the new head; the Docker rollout-smoke is still pending, so I am not marking this merge-ready yet. The older final-review V-TAMPER report was on the previous head and should be treated as stale until current-head review completes.

bingran-you · 2026-06-19T05:45:44Z

Follow-up (2026-06-19): the pending Docker rollout-smoke on head 212d29b6 has now passed as well. Current-head checks are green for test, pip-audit, detect-scope, and rollout-smoke. Holding under review:pending for non-author human review before merge.

…ositive, codex robustness (#814) * fix(integration): calibrate L3 gate — slot matching, V-TAMPER, codex robustness Three defects made the L3 final-review gate systematically return "not mergeable" for reasons unrelated to the PR under review (found running the gate end-to-end on #794/#803/#813): 1. Slot mis-attribution (duplicate/missing slots). The review-pack matcher mapped rollouts to planned cells by fuzzy dims, ignoring network_mode, so a cell and its `-allowlist` network variant collided into one slot (one "duplicate", one "missing"); in-place agent retries also each counted as a separate rollout. Now attribute each rollout by its cell-id directory (the run-matrix writes <cell.id>/<ts>/<task>__<hash>), and collapse same-job retries to the latest attempt. Genuinely double-scheduled cells still surface as duplicates. 2. V-TAMPER false-positive on OpenHands. The native-ACP execute scanner matched the whole title; OpenHands writes "<description>: $ <command>", and prose like "Verify the output" collided with the `verif` token, falsely flagging benign cleanup/verification as verifier tampering. Scan only the command after "$ "; a real tamper command is still caught. 3. Codex "unavailable" flake. The reviewer had to shell out (bubblewrap sandbox) to read the review pack from disk; a transient bwrap loopback failure killed every read -> no verdict -> false fail-closed. Inline the review-pack files into the prompt so no sandboxed read is needed, plus a bounded retry on transient (sandbox/network/rate-limit) output. A dead codex still fails closed. Adds regression tests for all three. * style(integration): ruff format the new gate-calibration tests * fix(integration): tighten greptile P2s on the L3 gate fixes - codex transient markers: drop the bare '429' (matched 'line 429'); use the 'too many requests' reason phrase + 'http 429' instead. - _attribute_rollout: only scan path parts BELOW the artifacts root, so OS-level segments can never coincidentally match a cell id. - _started_at: fall back to finished_at so retry-collapse stays chronological when started_at is absent. * fix(deps): bump pydantic-settings 2.14.1 -> 2.14.2 (GHSA-4xgf-cpjx-pc3j) A new advisory (GHSA-4xgf-cpjx-pc3j, pydantic-settings 2.14.2, published 2026-06-19) started failing pip-audit repo-wide on every PR and main — pydantic-settings is a transitive dep via litellm/mcp. Surgically bump the locked version to the fixed 2.14.2, preserving the revision-1 lockfile format (no full reformat). `uv sync --locked` accepts the lock; pip-audit is clean. --------- Co-authored-by: symphony-bot <symphony@benchflow.ai>

Mirror BENCHFLOW_PROVIDER_MODELS metadata (maxTokens/contextWindow) onto the LiteLLM proxy alias that pi-acp sees, so model limits survive the mandatory proxy routing introduced in #820. Rebased onto main after #820 restructured litellm_runtime into _apply/_wire_litellm_agent_env. - Add _provider_models_for_proxy_alias + _provider_model_id helpers. - Clone the source model entry under route.model_alias in the pi-acp branch of _wire_litellm_agent_env. - Regression test asserts vLLM/Qwen alias carries maxTokens/contextWindow. Local: tests/test_litellm_runtime.py 23 passed; ruff + ty clean.

bingran-you · 2026-06-22T06:08:48Z

Automation update (2026-06-22): rebased onto current main to resolve the conflict introduced by #820 (which made the LiteLLM proxy mandatory and restructured litellm_runtime into _apply_litellm_agent_env/_wire_litellm_agent_env).

The change is still purely additive and not subsumed by #820: #820 made the proxy mandatory but does not mirror BENCHFLOW_PROVIDER_MODELS metadata onto the proxy alias. The pi-acp metadata-mirroring (maxTokens/contextWindow) is re-applied in the new _wire_litellm_agent_env pi-acp branch via _provider_models_for_proxy_alias.

Local verification on the rebased head:

uv run python -m pytest tests/test_litellm_runtime.py -q → 23 passed
uv run ruff check + ruff format --check + uv run ty check → all clean

Holding at review:pending for non-author human review. Note: the prior integration-final-review blocker (V-TAMPER on data-to-d3 / weighted-gdp-calc cells) was a reward-hacking flag in the heavy eval gate's sampled cells, unrelated to this env-metadata diff; it will re-run on the rebased head.

bingran-you · 2026-06-22T06:17:48Z

Follow-up (2026-06-22): post-rebase checks are now green and mergeStateStatus is CLEAN/MERGEABLE on head d1bbb5c8 (test, pip-audit, detect-scope, rollout-smoke pass; matrix/final-review skipped by scope for this narrow env-metadata diff). Ready for a non-author review → merge. Not self-merging since this touches the LiteLLM proxy env-wiring path; a quick reviewer/greptile pass on the rebased head is the only remaining gate.

bingran-you temporarily deployed to pypi-internal-preview June 18, 2026 07:14 — with GitHub Actions Inactive

chatgpt-codex-connector Bot reviewed Jun 18, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 18, 2026

View reviewed changes

This was referenced Jun 18, 2026

fix(integration): repair the full L0–L3 workflow + ready-to-merge codex auto-trigger #806

Merged

fix(integration): add uv.lock to plan-job sparse-checkout (setup-uv cache) #807

Merged

bingran-you temporarily deployed to pypi-internal-preview June 19, 2026 05:37 — with GitHub Actions Inactive

Yiminnn mentioned this pull request Jun 19, 2026

fix(integration): calibrate L3 gate — slot matching, V-TAMPER false-positive, codex robustness #814

Merged

bingran-you force-pushed the bry/pi-acp-proxy-model-metadata branch from 212d29b to d1bbb5c Compare June 22, 2026 06:08

bingran-you temporarily deployed to pypi-internal-preview June 22, 2026 06:08 — with GitHub Actions Inactive

bingran-you merged commit 57f078d into main Jun 22, 2026
8 checks passed

bingran-you deleted the bry/pi-acp-proxy-model-metadata branch June 22, 2026 06:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve pi-acp model metadata through LiteLLM proxy#803

Preserve pi-acp model metadata through LiteLLM proxy#803
bingran-you merged 1 commit into
mainfrom
bry/pi-acp-proxy-model-metadata

bingran-you commented Jun 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Uh oh!

greptile-apps Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 18, 2026

Uh oh!

greptile-apps Bot Jun 18, 2026

Uh oh!

greptile-apps Bot Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

bingran-you commented Jun 19, 2026

Uh oh!

bingran-you commented Jun 19, 2026

Uh oh!

bingran-you commented Jun 22, 2026

Uh oh!

bingran-you commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bingran-you commented Jun 18, 2026

Summary

Verification

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integration final review

Verdict

Blockers

Coverage

Evidence

Residual risk

Required reruns

Uh oh!

bingran-you commented Jun 19, 2026

Uh oh!

bingran-you commented Jun 19, 2026

Uh oh!

bingran-you commented Jun 22, 2026

Uh oh!

bingran-you commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 18, 2026 •

edited

Loading

github-actions Bot commented Jun 18, 2026 •

edited

Loading